AITopics | data synthesis

Collaborating Authors

data synthesis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GRIP: AGraph-Based Reasoning Instruction Producer

Neural Information Processing SystemsJun-18-2026, 07:47:37 GMT

Large-scale, high-quality data is essential for advancing the reasoning capabilities of large language models (LLMs). As publicly available Internet data becomes increasingly scarce, synthetic data has emerged as a crucial research direction. However, existing data synthesis methods often suffer from limited scalability, insufficient sample diversity, and a tendency to overfit to seed data, which constrains their practical utility.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)

Add feedback

TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Neural Information Processing SystemsJun-12-2026, 09:17:10 GMT

Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases and low-variation prompts, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we introduce TreeSynth, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness, before synthesizing samples within each atomic subspace. This globally divide-and-synthesize method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the re-balancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently validates the superior data diversity, model performance, and robust scalability of TreeSynth compared to both human-crafted datasets and peer data synthesis methods, with the average performance gain reaching 10%. Besides, the consistent improvements of TreeSynth-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement.

large language model, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.75)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)

Add feedback

Elucidating the Design Space of Dataset Condensation

Neural Information Processing SystemsFeb-17-2026, 14:30:15 GMT

D ataset C ondensation ( EDC), establishes a benchmark for both small and large-scale dataset condensation.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Neural Information Processing SystemsFeb-17-2026, 03:58:37 GMT

The high-capacity generative models trained on large, diverse datasets have demonstrated remarkable success across vision and language tasks.

large language model, machine learning, mtd iff, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

983876577ec81db17ecfae1521df9208-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 20:29:51 GMT

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.05)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(3 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(5 more...)

Add feedback

22456f4b545572855c766df5eefc9832-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 20:26:13 GMT

generator, it-gan, synthesis, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Neural Information Processing SystemsDec-26-2025, 19:25:53 GMT

Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings.

diffusion model, effective planner and data synthesizer, multi-task reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

Neural Information Processing SystemsNov-20-2025, 02:33:19 GMT

Data serves as the fundamental basis for advancing deep learning.

data mining, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Sichuan Province > Chengdu (0.04)
Asia > Singapore (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Beyond One-Size-Fits-All: Neural Networks for Differentially Private Tabular Data Synthesis

Chen, Kai, Gong, Chen, Wang, Tianhao

arXiv.org Artificial IntelligenceNov-19-2025

In differentially private (DP) tabular data synthesis, the consensus is that statistical models are better than neural network (NN)-based methods. However, we argue that this conclusion is incomplete and overlooks the challenge of densely correlated datasets, where intricate dependencies can overwhelm statistical models. In such complex scenarios, neural networks are more suitable due to their capacity to fit complex distributions by learning directly from samples. Despite this potential, existing NN-based algorithms still suffer from significant limitations. We therefore propose MargNet, incorporating successful algorithmic designs of statistical models into neural networks. MargNet applies an adaptive marginal selection strategy and trains the neural networks to generate data that conforms to the selected marginals. On sparsely correlated datasets, our approach achieves utility close to the best statistical method while offering an average 7$\times$ speedup over it. More importantly, on densely correlated datasets, MargNet establishes a new state-of-the-art, reducing fidelity error by up to 26\% compared to the previous best. We release our code on GitHub.\footnote{https://github.com/KaiChen9909/margnet}

artificial intelligence, dataset, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.13893

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK

Krchova, Ivona, Vieyra, Mariana Vargas, Scriminaci, Mario, Sidorenko, Andrey

arXiv.org Artificial IntelligenceNov-14-2025

Abstract--Machine learning development critically depends on access to high-quality data. However, increasing restrictions due to privacy, proprietary interests, and ethical concerns have created significant barriers to data accessibility. Synthetic data offers a viable solution by enabling safe, broad data usage without compromising sensitive information. This paper presents the MOSTL Y AI Synthetic Data Software Development Kit (SDK), an open-source toolkit designed specifically for synthesizing high-quality tabular data. The SDK integrates robust features such as differential privacy guarantees, fairness-aware data generation, and automated quality assurance into a flexible and accessible Python interface. Leveraging the T abularARGN autoregressive framework, the SDK supports diverse data types and complex multi-table and sequential datasets, delivering competitive performance with notable improvements in speed and usability. Currently deployed both as a cloud service and locally installable software, the SDK has seen rapid adoption, highlighting its practicality in addressing real-world data bottlenecks and promoting widespread data democratization. HE development of Machine Learning applications requires broad access to training data. This necessity has become more critical in recent years with the advent of Deep Learning, which requires large-scale datasets to effectively train models.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.00718

Genre: Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback